Optimal Monte Carlo Updating 
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Based on Peskun's theorem it is shown that optimal transition matrices in Markov chain Monte 
Carlo should have zero diagonal elements except for the diagonal element corresponding to the 
largest weight. We will compare the statistical efficiency of this sampler to existing algorithms, such 
as heat-bath updating and the Metropolis algorithm. We provide numerical results for the Potts 
model as an application in classical physics. As an application in quantum physics we consider the 
spin 3/2XY model and the Bose- Hubbard model which have been simulated by the directed loop 
algorithm in the stochastic series expansion framework. 
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Monte Carlo methods are nowadays used in almost ev- 
ery branch of science, offering exact results in a statistical 
sense or providing answers where other methods fail. Al- 
ready in statistical physics alone, Monte Carlo methods 
have been applied to a variety of models [lj . For many ap- 
plications, good algorithms have been devised and there 
exist now solutions to many problems that were initially 
untractable. A well known example is the critical slowing 
down in the neighborhood of critical points that has been 
overcome in both classical (by cluster algorithms @ J3) 
and quantum Monte Carlo (by the loop algorithms 
The need for better performing algorithms is clear: effi- 
cient algorithms lead to more accurate results at the same 
computational cost. Yet little that goes beyond common 
sense reasoning is known about why an algorithm is effi- 
cient or not, and within a chosen algorithm there is often 
additional freedom. 

We address the question of the efficiency of Markov 
chain Monte Carlo (MCMC) algorithms in terms of 
smaller error bars. We first touch upon the needed termi- 
nology as it is usually [l| understood in statistical physics 
from a practitioner's viewpoint. We show how optimal 
sampling enters into this discussion and comment on its 
implementation. Finally, we compare it with standard 
updating mechanisms for the Potts model and for the 
directed loop algorithm 0,0 m the stochastic series ex- 
pansion ||. 

In MCMC a transition kernel (matrix) T is set up 
and we will assume that we know the discrete weights 
Wi,...,W n (finite, computable set) of the invariant 
probability distribution W. If the following two condi- 
tions hold 



(i) normalization of probability, Tg = l,Vi 

(ii) reversibility (detailed balance), WiTij = WjTji, 



and the chain can connect any two states in a certain 
finite number of steps, then the Markov chain will con- 
verge to the invariant probability distribution (which 
will be W). The stochastic matrix T has as largest 
eigenvalue 1, while the other eigenvalues are sorted by 
— I < Xj < l,j = 2, . . . n. Strictly speaking, the condi- 
tion (ii) is a too strong [lj} condition to assure conver- 
gence of the Markov chain towards the invariant distribu- 
tion W, it suffices J2iWiTij ~ Wj, but the reversibility 
condition is widely used in practical applications. 

The Markov process correlates the measurements of 
the observables Q in consecutive steps. The variance 
<Jq on these correlated measurements is not equal to the 
variance <7q q obtained from uncorrelated measurements. 
Instead, Oq = lTi nt ^Qo\ q, in which we have introduced 
the integrated autocorrelation time [j| [Tl|. 



r m t,Q = -+J2 A Q( x(t) )- 

4=1 



(1) 



Stationary samples x^> at the Monte Carlo times t are 
obtained from the sampler while the normalized autocor- 
relation function Aq(x^) for the observable Q is given 
by 

Q[ ' (Q 2 (xW)) - (Q(xW))* ' [ > 

in which the ensemble average (. . .) is taken over i. We 
can now make a connection with the second largest eigen- 
value by 



sup T int 



Q 



I + A 2 
2(1 -A 2 ) 



(3) 



The following discussion focuses on the eigenvalues 
A2, A3, . . . to obtain a lower asymptotic variance for an 
observable Q, 
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A different question concerns the convergence of a 
probability distribution towards the invariant probability 
distribution. It is dominated by the second largest eigen- 
value in absolute value of the stochastic matrix, which 
can be different from A2 for non-positive operators, and 
would determine the required number of thermalization 
or burn- in steps. Note that non-reversible transition ker- 
nels can converge faster |l3j . 

The stochastic matrix has the dimension of the Hilbert 
space, and all algorithms consist of two different oper- 
ations in every Monte Carlo step: a limitation on the 
configurations that can be reached and secondly the ac- 
ceptance or the rejection of the transition to one of them. 
For instance, heat-bath updating (also called the Gibbs 
sampler ^j) in the Ising model with dimension LxL can 
in one step reach only L 2 + 1 different configurations from 
the current configuration which has weight W\. Among 
the L 2 new ones, it picks one at random and the tran- 
sition to this trial configuration with weight W% will be 
accepted according to , otherwise it remains in 

the current configuration. Note that all eigenvalues of 
the Gibbs sampler are positive pl| . 

We will now focus on this second step of the update. 
For the Ising model, there are only two different configu- 
rations that play a role. How should we choose the tran- 
sition matrix so that the asymptotic variance is smallest? 
A hint is given by the Peskun theorem , stating that 
if T A and T B both satisfy the conditions (i) and (ii) and 
if all off-diagonal elements of T B are larger or equal than 
the corresponding elements of T A , then T B will lead to 
a smaller asymptotic variance for all observables than 
T A , or equivalently, X A > X B . It then follows that the 
Metropolis-Hastings algorithm 01 is by construction the 
most effective sampler for the Ising model with random 
single site updates (and not the Gibbs sampler). For all 
possible stochastic matrices with dimension n = 2 the 
Metropolis transition matrix [lsf is given by 
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(•5) 



is obtained, 
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min(j^-, )■ Liu 19 : ] has applied this 
idea to the independence Gibbs sampler, and obtained 
a complete eigenanalysis for the resulting stochastic ma- 
trix. 

It is possible to repeat this Metropolizing procedure 
until all but one of the diagonal elements are zero. So, 
optimal transition matrices must have Tn = 0,i / n. 
Indeed, Frigessi et al. 20] have shown that the optimal 
transition matrix is of the form, 



T, 



Opt 



ffttt %K 
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with Vl = = (1 - Vi)j=^, ■ ■ ■• The eigen- 

values are given by 1,A2 = — 2/1 (the same as in Eq.©) 
, A3 = — U2,.... They are all negative and appear in 
an ordered way. This A2 has the lowest value that 
possibly can be obtained with respect to the probability 
distribution W, and with A2 determined, A3 is then the 
smallest possible third largest eigenvalue, etc. Note that 
a rescaling is at work here, the entries for the second 
row Tij ,j = 2,...,n are analogous to the first row apart 
from the rescaling 1 — > (1 — J/i ) ■ Eq.(EJ represents an 
optimal transition matrix over the entire Hilbert space, 
however, many situations of practical interest need to 
sample stochastic subprocesses. Within these, optimal 
sampling can only be achieved when all but one of the 
diagonal elements are zero. When Eq.([7|) is applied to a 
stochastic subprocess, we will call it the locally optimal 
algorithm. 



Here we have ordered the weights in ascending order. 
This non-standard way of writing Metropolis updat- 
ing shows however the key ingredients of its efficiency, 
namely that the chance of staying in the current configu- 
ration should be minimized and secondly that the second 
largest eigenvalue is A2 = —T^f' 



->Met _ Wi 

w 2 • 



Peskun's theorem implies an ordering of the weights. 
So let (for the remaining of the paper) -k\ < iT2 < 
. . . < 7r„ be the normalized weights in ascending order, 
7T; = sr W Lr ■ Peskun's theorem tells us that we can al- 

ways improve a transition matrix T by 'Metropolizing' it, 
T!j = . , Vj 7^ i. Applying this idea to heat-bath 



Potts model. As a first application, we consider the 
q = 4 Potts model in two dimensions. We are inter- 
ested in the dynamics of the Monte Carlo process. There- 
fore we consider a small lattice with single-spin updates 
only, and we do not want to use cluster updates @] 
here. So we will randomly select a spin, after which we 
have to make a choice between the four possible orien- 
tations that this spin can take. Although the random 
selection of a site and the single spin update both seri- 
ously violate the structure of the optimal stochastic ma- 
trix Eq.(UJ, the Peskun theorem still holds, meaning that 
the matrix Eq.([7|) (applied to the stochastic subprocess) 
still leads to a more effective sampling than heat-bath 
updating, = J2k\v k = This of course cannot 



updates, the following Metropolized Gibbs sampler(MG) cure the fact that the updating of a single spin still leads 
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FIG. 1: The decorrelation factor for the energy as a function 
of temperature /3 at constant interaction strength for the q = 
4 Potts model on a L = 4 x 4 matrix is shown for heat- 
bath updates and for the locally optimal ones. Simulations 
consisted of 4000 chains of one million steps each. 



to the divergence of the autocorrelation time when the 
temperature reaches its critical value (5 C . 

The decorrelation factor (2l| Oq ja\ q is defined as the 
ratio between the error bars obtained from a Monte-Carlo 
simulation with correlations between successive samples 
and the error bars one would obtain from the same num- 
ber of independent but identically distributed samples. 
In the limit of a large number of samples, the decorrela- 
tion factor becomes equal to twice the integrated auto- 
correlation time (Eq.JTJ). The decorrelation factor can 
accurately be estimated by running a large number of 
independent Markov chains. In Fig. ^ we see that the 
decorrelation factor is much smaller for the locally opti- 
mal algorithm than for the heat-bath algorithm. 

A small lattice of L = 4 x 4 has been chosen since 
the decorrelation factor scales with system size for 
single-site updates such that the difference between 
the two algorithms will be larger in absolute terms 
when using a small lattice. We clearly see a signif- 
icant difference between the locally optimal and the 
heat-bath transition probabilities. The Metropolizcd 
Gibbs sampler T MG (Eq.((BJ|) has the same A 2 as the 
optimal one (Eq.( 0), and the integrated autocorre- 
lation times differ only slightly for the 4x4 Potts 
model. Note that it is much easier to implement T MG 
than T° pt , and if in practice the stochastic matri- 
ces cannot be computed in advance and need to be 
recomputed at each step, it is recommended to use T MG . 

Directed Loops. The same reasoning also applies to 
quantum Monte Carlo. In the stochastic series expansion 
method |8| a Taylor expansion is applied to the partition 



function Z = Tr exp(—f3H), yielding 

00 /Dm 

Z =E^[ E E <<i| - S^IiaXial - ^ |*3> 

m=0 ' {i 1: ...S m } {&i,...,6 m } 

• ■ • (i m -i\ - H bm _ 1 \i m )(i m \ - H bm \ii), (8) 

where the Hamiltonian H is decomposed in a set of 
bond operators, H — ^2 b H b , and complete sets \ik),k — 
1, ... ,n have been inserted. In the first step, a diago- 
nal update is performed in which the expansion order 
m can change. The second, off-diagonal update mimics 
the idea of loop-type Q and worm |2^ | updates and can 
best be explained using a graphical interpretation. Ev- 
ery matrix element in Eq.([SJ) is called a vertex, at which 
the two sites of the interaction and an imaginary time 
are assigned. Every vertex has four legs: two incoming 
and two outgoing legs per site, corresponding to the par- 
ticles created and annihilated by H b . The legs of the 
vertices are connected by segments corresponding to the 
occupied sites. A worm is created in a arbitrary point in 
space-time by inserting a creation and annihilation op- 
erator on a segment. One of these operators is chosen 
to be the worm head and is mobile, while the other one 
is the worm tail and remains immobile. The worm head 
moves through configuration space and can change the 
type of the vertices, for instance from a diagonal to an 
off-diagonal vertex. The worm movement stops when the 
worm head bites into its own tail. The entire movement 
of the worm can then be regarded as a single loop update. 

Once the worm has been created, its movement is com- 
pletely determined by how it passes through and modifies 
the vertices. Suppose the worm head reaches a vertex at 
the leg left under (entrance leg), as in Fig.|2 The local 
configuration can then change according to the four pro- 
cesses bounce, straight, jump and turn, each with their 
proper weight Wi, i = 1, . . . ,n. For models with num- 
ber conservation, it is possible that one or more of the 
four processes cannot occur, so n can in principle be two, 
three or four. The bounce process is always possible but 
since it does not change the current vertex, it can be 
regarded as a waste of computer time. The worm head 
has to choose between one of the n processes, modifies 
hereby the current vertex and goes consequently to the 
next vertex along the segment that connects the current 
exit leg and the next entrance leg. The probability ma- 
trix Tij defines the transition from the entrance leg to 
the exit legs and hence completely determines the worm 
movement. We will now discuss several choices for this 
probability matrix Ty . 

Originally, the heat-bath updates @, = ^-, w ^ / , 

(solution A) were proposed. Secondly, other choices 
are perfectly possible: Syljuasen and Sandvik propose 
directed loops where the worm head has a preferred 
direction at the vertices in order to be more efficient 
than solution A. The rule of thumb is that the frequency 
of the bounce processes should be as low as possible. 
This inspired the authors of Ref. to numerically 
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FIG. 2: The four possible states that can arise when the 
worm enters the leftmost vertex at the leg left under in a 
number occupation basis (can also be a spin state) and for 
a system with particle number conservation. A single line 
means that the leg is singly occupied, a double line means 
double occupancy and a dashed line denotes that the leg is 
not occupied. The four processes on the right correspond to 
bounce, straight, jump and turn, from left to right. 
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minimize the trace of the probability matrix with 
respect to detailed balance, WiTij = WjTji (hereafter 
called solution B or the minimal bounce solution). 
They used a linear programming technique [24| for this 
purpose. Equivalcntly, one can say that this amounts to 
minimizing the sum of the eigenvalues of the transition 
matrix T^, min(A2 + A3 + A4). Thirdly, we propose to 
use the locally optimal probability matrix Eq.( 01 as 
transition matrix (solution C) for the scattering of the 
worm at a vertex. 

Spin 3/2XY model. In Ref. [2j| the directed loop al- 
gorithm was studied for the one dimensional spin 3/2XY 
model in an external magnetic field h, 

H = J Y, \(^ s 7 + s ^ s t) - h Y, s *> ( 9 ) 

where the first sum extends over nearest neighbors and 
J is an exchange interaction term. It appeared that so- 
lution B always gave shorter autocorrelation times than 
solution A, as can also be seen in Fig. [3] and in Fig. 0] 
where the integrated autocorrelation times for the uni- 
form magnetization and the energy are plotted. 

The authors of Ref. [23| also proposed to break detailed 
balance to fiWiTij — fjWjTji, with /; an extra deg ree of 
freedom at leg i. Using a linear programming [24( tech- 
nique they were able to further minimize the trace of the 
probability matrix. They found that this algorithm gave 
the shortest autocorrelation times for most values of the 
magnetic field, except around h « 0.8 where it behaved 
worse than solution B and, unexpectedly, even worse 
than solution A. They deduced that alternative princi- 
ples than minimizing bounces might exist. Furthermore, 
this algorithm needs to be used carefully, since it modifies 
condition (ii) with respect to the invariant distribution 
for the Green's function (af (O)oj(i)), although it is 
correctly weighted with respect to the invariant distri- 
bution for diagonal observables such as the energy and 
the magnetization. Note that it is also possible to apply 
Eq.(0| to the modified weights W[ = foWi in order to 
obtain the locally optimal algorithm in case one is not 



FIG. 3: Integrated autocorrelation times for the magneti- 
zation, Ti n t(M), as a function of magnetic field h for a spin 
3/2 AY model as in Ref. p^ . lattice size L = 64, inverse tem- 
perature (3 — 64. The integrated autocorrelation times are 
made loop size independent (normalized to two worms per 
update) so that the heat bath (solution A), minimal bounce 
(solution B) and locally optimal (solution C) algorithms can 
directly be compared. The precise definition of these algo- 
rithms is explained in the text. 



interested in the Green's function. Therefore we will not 
further compare this proposal with solutions A, B and C. 

We also addressed the spin 3/2XY model with solution 
C and the results for the integrated autocorrelation times 
for the magnetization and the energy are also presented 
in Fig. [31 and in Fig. We find a substantial improve- 
ment over solution A. Solution C is also much better than 
solution B for magnetic fields around h w 0.8, while for 
other magnetic fields the difference is smaller. This in- 
deed shows that minimizing bounces is not optimal for 
the spin 3/2XY model. Note that increasing the system 
size does not significantly change the ratio of the corre- 
lation times of solutions B and C. 

Bose-Hubbard model. We also present results for the 
Bose-Hubbard model [2(| in one dimension (units are as 
in Ref. 

H = -t J2 bib, + \uY, n i( n i ~ !) - E ( 10 ) 

The first term represents hopping with strength t of the 
bosons between nearest neighbors, the chemical potential 
is denoted by /x and the second term takes on-site repul- 
sion with strength U into account. In the Bose-Hubbard 
model, the diagonal weights are relatively much larger 
compared to the non-diagonal weights than in spin sys- 
tems. Again, as can be seen in Fig. |SJ the heat-bath 
updates (solution A) are outperformed, but for large on- 
site repulsion U the minimal bounce solution (solution 

B) is superior to the locally optimal solution (solution 

C) . 



5 



heat bath (A) i- 
min. bounce (B) L - 
locally optimal (C) 




50 
45 
40 
35 
30 

\ 25 
20 
15 
10 
5 




heat bath (A) 
min. bounce (B) 
locally optimal (C) 




5 
U 



FIG. 4: Analogous to Fig. El but now for the integrated au- 
tocorrelation time of the energy, n n t(E), as a function of the 
magnetic field h. 



As in the Potts model, we are guaranteed that solu- 
tion C is more efficient than solution A, but the Peskun 
theorem does not claim that solution C is superior to so- 
lution B, because choosing the lowest (local) A2 will not 
necessarily correspond to the lowest integrated autocor- 
relation time, Ti n t (Eq.J3>. Specifically, in case n = 2 
both solutions reduce to Metropolis updating Eq.JSJ). In 
case n = 3 solution B is of the form 



rpB _ 
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2LL(1 _ 
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a) m - Zka 



1 -a 
1 - 

2-7T3 — l + ^7Tl a 



(11) 



The linear programming technique |24j applied in solu- 
tion B will try to make a — 1 ~^ 3 if 7r3 < |, otherwise 
it will take a = 0. It will be the system parameters that 
determine whether T B or T c (Eq.Q for n = 3) per- 
forms better. Due to its structure it is also not possible 
to improve Tb by Metropolizing it. 

Also in case n = 4 both solutions B and C put all di- 
agonal elements in the (local) stochastic matrix zero, ex- 
cept for the diagonal element corresponding to the largest 
weight. We know from the Peskun theorem that this 
leads to an efficient sampling. As a counterexample, the 
Metropolized Gibbs sampler Eq.(El) has only one zero on 
the diagonal of the transition matrix and is systemati- 
cally outperformed by both solution B and C. 

The class of locally optimal stochastic matrices can be 
parameterized as 



T ■ — 





^X 
7T 4 



a 




I2y 

7T4 ■* 
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V 

z 

7T 4 



(12) 



with x = l-a-b,y= l-^a-c,z= and the 

' a 7T2 ' 7T3 7T2 

three free parameters a, b and c. Solution B will now try 
to minimize T44, or minimize a, b and c, under a number 
of constraints such as a + b < 1. This can be suboptimal 



FIG. 5: Integrated autocorrelation time for the density, 
Tint(n), as a function of on-site repulsion U for a one dimen- 
sional Bose-Hubbard model with fj, = 5,t = 1, lattice size 
L — 64 and inverse temperature j3 = L, Four loops were 
constructed in every update for the 'min. bounce (B)' and 
'locally optimal (C)' algorithms, while for 'heat bath (A)' we 
constructed 16 loops and multiplied the results by four af- 
terwards. Particle number cut-off is lowered at U = 3 and 
U = 8. The Mott phase sets in for U > 9. 



however, since the minimum will be found when one or 
more of the constraints are exactly met 24] . Suppose 
the minimum is found for a + b = 1, as it happens in 
the spin 3/2XY model for magnetic fields h « 0.8. Then 
the scattering from the least probable state to the most 
probable state is zero, and this clearly cannot be optimal 
and explains why solution C is better in Fig. |5J 

The overall conclusion is that the integrated auto- 
correlation times for both solutions B and C will be of 
the same order and roughly optimal. The important 
principle is that the diagonal elements corresponding 
to the lowest n — 1 weights should be zero. Some 
arbitrariness is still retained, but it does not seem 
possible to define how the remaining freedom should be 
chosen independently of the weights that occur in the 
updating process. When the diagonal and non-diagonal 
weights are of the same order, solution C is better, 
while for large diagonal weights solution B gives the 
lowest integrated autocorrelation times. Furthermore, 
we have a good argument why these solutions lead to 
a locally optimal sampling, based on the Peskun theorem. 

Conclusion. We have shown that the optimal transi- 
tion kernel for Markov chain Monte Carlo should have 
a zero diagonal, except for the diagonal element cor- 
responding to the largest weight (which can be large). 
We have presented results for the Potts model with ran- 
dom single spin updates and for quantum spin chains 
and the Bose-Hubbard model with the directed loop al- 
gorithms. They confirm the theoretical reasoning. Our 
results suggest a practical way to improve existing Monte 
Carlo methods. This could lead to significant gains in 
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efficiency in both classicai and quantum Monte Carlo. 
One could consider applications in the research fields of 
flat histogram methods [2a], the loop algorithm Q| 0, 
the worm algorithm j2^| and the fast updates in auxil- 
iary field quantum Monte Carlo [2l| [3(| and shell model 
Monte Carlo Furthermore, our results could shed a 
new light on the correlations in Glauber dynamics |32| . 
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